Skip to content

feat: dynamic egress network rule updates for running sandboxes#2074

Open
levb wants to merge 45 commits intomainfrom
lev-allow-deny-dynamic
Open

feat: dynamic egress network rule updates for running sandboxes#2074
levb wants to merge 45 commits intomainfrom
lev-allow-deny-dynamic

Conversation

@levb
Copy link
Contributor

@levb levb commented Mar 8, 2026

Summary

  • Add PUT /sandboxes/{sandboxID}/network API endpoint to dynamically update egress firewall rules (allow/deny CIDRs and domains) on running sandboxes without restart
  • Full end-to-end flow: API handler → sandbox store update → gRPC UpdateNetwork RPC → nftables rule replacement in sandbox network namespace
  • Domain-based filtering supported in allowOut via TLS SNI matching in the TCP proxy
  • Rules persist across pause/resume cycles
  • Thread-safe in-memory egress config using a single RWMutex protecting startedAt, endAt, and networkEgress
  • GetNetwork() guaranteed non-nil, eliminating nil checks in proxy and TCP firewall code paths
  • CIDRs validated before firewall reset to prevent partial rule application

Test plan

  • Integration test: dynamically allow a specific IP while denying all other traffic
  • Integration test: clear all rules and verify traffic flows freely
  • Integration test: replace rules (PUT semantics, not additive)
  • Integration test: 404 for nonexistent sandbox
  • Integration test: 401 without API key
  • Integration test: rules survive pause/resume cycle
  • Integration test: multiple sequential updates each change actual connectivity
  • Integration test: domain allow/remove via TLS SNI

levb and others added 10 commits March 8, 2026 10:04
Add API endpoint to dynamically update sandbox egress network rules on
a running sandbox. The handler updates the API sandbox store atomically;
the orchestrator gRPC call is stubbed for now (stage 2).

Includes integration tests for success, not-found, unauthorized, clear
rules, replace rules, and multiple sequential updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Found

The previous ID "nonexistent-sandbox-id" had too many hyphens, failing
ID validation (400) before reaching the not-found path (404).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add full gRPC flow for runtime sandbox network config updates:
- Proto: SandboxUpdateNetworkRequest + UpdateNetwork RPC
- Orchestrator: UpdateNetwork handler updates nftables (non-TCP) and
  in-memory egress config (TCP proxy) with RWMutex thread safety
- API: normalize bare IPs to CIDR notation before gRPC call
- Slot.UpdateInternet: reset + re-apply firewall rules in sandbox netns
- Thread-safe networkEgress field on Metadata (same pattern as endAt)
- Integration tests verify actual connectivity changes (not just 204s)
- test-network-update Makefile target for convenience

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dedicated test target

Remove unused *sandbox.Sandbox return value from UpdateSandboxNetworkConfig
since no caller uses it. Remove dedicated test-network-update Makefile target
in favor of the generic test/% pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The creation path splits AllowOut entries into CIDRs and domains via
ParseAddressesAndDomains, but the update path only handled CIDRs. Now
the update path mirrors the creation path: domains are extracted, the
default nameserver is added when domains are present, and domains are
passed through gRPC to the orchestrator's in-memory config for the TCP
proxy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that network egress rules set via PUT /sandboxes/{id}/network
survive a pause/resume cycle — the resumed sandbox should still enforce
the same allow/deny rules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify that dynamically adding a domain to AllowOut (via TLS SNI matching
in the TCP proxy) works, and that removing the domain re-blocks traffic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate API docs

Consolidate startedAtMu/endAtMu/networkEgressMu into single rwmu.
GetNetwork() now always returns non-nil with non-nil Ingress, removing
nil checks in proxy.go and tcpfirewall. Update openapi.yml allowOut/denyOut
descriptions to document domain support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se nil-safe getters

Validate all CIDRs upfront in UpdateInternet to prevent partial firewall
state if an invalid CIDR is passed after rules have been reset. Add
missing 409 Conflict response to the PUT /network OpenAPI spec. Use
GetIngress() nil-safe getter in proxy.go for consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
levb and others added 7 commits March 8, 2026 11:31
After refactoring GetNetwork() to read egress from the mutex-protected
networkEgress field, tests that construct Metadata directly must call
SetNetworkEgress() to mirror what the constructor does.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the sandbox store mutation after the gRPC call to the orchestrator
node. Previously the store was updated first, so a failed gRPC call
left the store with the new config while the sandbox still enforced the
old rules. Now: Get (read-only) → gRPC → Update (persist).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…at-a-time functions

All firewall set mutations (ConfigureInternet, UpdateInternet, ResetInternet,
and the constructor) now go through ReplaceUserRules which buffers all 4 set
replacements and issues a single atomic nftables Flush. Removes AddAllowedCIDR,
AddDeniedCIDR, Reset, ResetDeniedSets, ResetAllowedSets, and addCIDRToSet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The firewall_toolkit's ValidateAddress rejects 0.0.0.0 as "unspecified",
which broke sandbox creation with internetAccess=false after the atomic
ReplaceUserRules refactoring. The old addCIDRToSet bypassed the toolkit
for this case by creating raw nftables elements directly; restore that
bypass in clearAndReplaceCIDRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@levb levb marked this pull request as ready for review March 8, 2026 20:21
@levb levb assigned dobrac and unassigned ValentaTomas Mar 8, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a95c13a2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

levb and others added 2 commits March 9, 2026 09:49
Match the create path's validation in the PUT /sandboxes/{id}/network
handler: reject non-IP/CIDR values in denyOut (400 instead of 500),
and require 0.0.0.0/0 in denyOut when allowOut contains domains.
Add integration tests for both cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… table-driven suite

Single sandbox, sequential subtests covering: 8 rejected inputs, 10 accepted inputs,
12 firewall rule steps (deny-all, allow precedence, replace semantics, domain/IP/CIDR,
clear/reapply, allow-without-deny), and pause/resume rule preservation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
levb and others added 4 commits March 9, 2026 12:29
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tional apply/revert

Combine the separate UpdateNetwork and Update gRPC RPCs into a single
Update RPC that handles both end_time and egress changes. Updates are
applied sequentially using a functional updater pattern — on failure,
already-applied changes are rolled back in reverse order.

Publish a SandboxUpdatedEvent for any successful update, including
network egress details (allowed/denied CIDRs, allowed domains) in
the event data alongside the existing set_timeout field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The networkEgress field is set explicitly by the Update handler after
UpdateInternet succeeds. Initializing it in ResumeSandbox was not
needed by any change in this PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@levb levb requested a review from dobrac March 10, 2026 00:01
levb and others added 2 commits March 9, 2026 18:36
The previous commit (ed87f50) incorrectly removed networkEgress
initialization from ResumeSandbox, assuming it was only needed for the
dynamic update path. However, the API's Create handler also calls
ResumeSandbox, so the egress config was never populated — causing the
TCP proxy to see nil egress and allow all traffic through.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
levb and others added 5 commits March 10, 2026 15:18
Extract buildEgressConfig to deduplicate ParseAddressesAndDomains +
nameserver + CIDR conversion logic between create and update paths.
Add idna.Lookup.ToASCII validation for domain entries in allowOut.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ates

Rollbacks use context.WithoutCancel to ensure they complete even if the
request context is canceled. Replaces manual rollback loop in Update RPC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single source of truth for network settings — no more duplicate
networkEgress field. Config.network is private with GetNetwork,
SetNetworkEgress, and SetNetwork methods protected by Config's
own RWMutex. Config is now *Config to avoid copylocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip *. prefix before idna.Lookup.ToASCII validation since
the wildcard is our own syntax, not a DNS label. Fixes
integration test for *.example.com with deny all.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename function and file for clarity. Initialize Config in test
sandbox (was nil, causing panic). Fix egress assertion to match
GetNetwork() non-nil defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@levb levb requested a review from dobrac March 11, 2026 00:12
levb and others added 8 commits March 12, 2026 03:37
…IDRs

FlushSet only buffers a "clear set" netlink message — it does NOT
commit to the kernel. Clarify this with comments and remove the
redundant top-level FlushSet (ClearAndAddElements does its own).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfig

Replace direct &sandbox.Config{} + SetNetwork() pattern with
sandbox.NewConfig(sandbox.Config{...}) constructor that normalizes nil
Network to an empty proto and initializes the RWMutex. This removes
nil-guard boilerplate from GetNetwork() and prevents nil-pointer panics
when accessing network sub-fields via proto getters.

Also updates proxy.go to use proto nil-safe GetTrafficAccessToken()
instead of direct field access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Main's f9e23eb changed Map.Insert/Get signatures (requiring Slot and
MarkRunning), which broke test helpers in this file. Rather than
fixing all the boilerplate, remove the tests that are redundant with
integration tests (NotFound, EndTimeOnly, NoFieldsSet) and keep only
the two rollback tests that exercise failure paths not covered by
integration: egress failure must not change end_time, and a combined
end_time+egress update must revert end_time when egress fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GetNetwork() returned the proto pointer after releasing the read lock,
allowing concurrent reads of Egress/Ingress to race with SetNetworkEgress
writes. Replace GetNetwork() with GetNetworkEgress/GetNetworkIngress/
SetNetworkEgress that hold the lock while accessing sub-fields. Remove
unused SetNetwork method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@levb levb requested a review from dobrac March 12, 2026 13:31
@ValentaTomas ValentaTomas removed their request for review March 12, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants